Car music project

Data can be found here: https://github.com/irecsys/CARSKit/tree/master/context-aware_data_sets

The dataset was originaly useds in the CARSKit project (described in the paper https://arxiv.org/pdf/1511.03780.pdf), but we'll try other approaches using known open-source machine learning frameworks.

In [1]:
# Purpose: to understand data better
# 
# How?: Exploratory analysis with simple feature engineering
# 
# Steps: 
# ------
# 1) Load and clean the data
# 2) Join the data to produce one data frame
# 3) Make basic feature enginnering
#   a) count basic metric per user
#   b) use domain knowledge (to introduce more information)
#   b) hot-encoding of categorical variables
#   c) transform variables, if neccesary
# 4) Pair-plot numeric features
# 4) Make seperatly raw and aggregated dataset (grouped by pair (user, song)
# 5) Try to segmentize users
# 6) Do dimensionality reduction
# 7) Make an entrée model
# 8) Export data
#
# Goals: a) to clean and engineer the data so it can be understood by most ML/DL tools 
#        b) to suggest on which features we should focus most
#
# Oskar Jarczyk (oskar.jarczyk@pja.edu.pl)

Prepare packages

Python standard library

In [2]:
import itertools

Data science toolkit

In [3]:
import numpy as np
import pandas as pd
In [4]:
import pickle
In [5]:
import pandas_profiling
In [6]:
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

Machine learning

In [7]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.decomposition import PCA
In [8]:
from sklearn.manifold import TSNE
In [9]:
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler
In [10]:
from apyori import apriori

Visualization packages

In [11]:
import matplotlib.pyplot as plt
import seaborn as sns
In [12]:
import plotly.express as px
In [13]:
from mpl_toolkits.mplot3d import Axes3D
In [14]:
%matplotlib inline

Read data

In [15]:
filename = 'Data_InCarMusic.xlsx'
In [16]:
sheetname = 'ContextualRating'
contextual_rating = pd.read_excel(filename, sheet_name=sheetname)
print(f'Successfully read {len(contextual_rating):,} rows from the "{sheetname}" Excel sheet')
print(f'Columns found: {contextual_rating.columns.tolist()}')
Successfully read 4,012 rows from the "ContextualRating" Excel sheet
Columns found: ['UserID', 'ItemID', ' Rating', 'DrivingStyle', 'landscape', 'mood', 'naturalphenomena ', 'RoadType', 'sleepiness', 'trafficConditions', 'weather']
In [17]:
sheetname = 'Music Track'
music_track = pd.read_excel(filename, sheet_name=sheetname)
print(f'Successfully read {len(music_track):,} rows from the "{sheetname}" Excel sheet')
print(f'Columns found: {music_track.columns.tolist()}')
Successfully read 139 rows from the "Music Track" Excel sheet
Columns found: ['id', ' album', ' artist', ' title', ' mp3url', ' description', ' imageurl', ' category_id']
In [18]:
sheetname = 'Music Category'
music_category = pd.read_excel(filename, sheet_name=sheetname, header=None)
print(f'Successfully read {len(music_category):,} rows from the "{sheetname}" Excel sheet')
print(f'Columns found: {music_category.columns.tolist()} (temporary column names)')
Successfully read 10 rows from the "Music Category" Excel sheet
Columns found: [0, 1] (temporary column names)

Data cleaning

Contextual ratings

In [19]:
contextual_rating.columns.to_series().groupby(contextual_rating.dtypes).groups
Out[19]:
{dtype('int64'): Index(['UserID', 'ItemID', ' Rating'], dtype='object'),
 dtype('O'): Index(['DrivingStyle', 'landscape', 'mood', 'naturalphenomena ', 'RoadType',
        'sleepiness', 'trafficConditions', 'weather'],
       dtype='object')}

Most of the column are of 'string' type, except for ID columns like the "user id", "song id", and the song rating

In [20]:
contextual_rating.dtypes
Out[20]:
UserID                int64
ItemID                int64
 Rating               int64
DrivingStyle         object
landscape            object
mood                 object
naturalphenomena     object
RoadType             object
sleepiness           object
trafficConditions    object
weather              object
dtype: object

Column names have whitespace in them, let's remove them by using str.strip() method from the Python standard library.

In [21]:
contextual_rating.columns = [x.strip() for x in contextual_rating.columns.tolist()]

Music tracks

In [22]:
music_track.columns.to_series().groupby(music_track.dtypes).groups
Out[22]:
{dtype('int64'): Index(['id', ' category_id'], dtype='object'),
 dtype('O'): Index([' album', ' artist', ' title', ' mp3url', ' description', ' imageurl'], dtype='object')}

Again, some column names have whitespace in them, let's stip them

In [23]:
music_track.columns = [x.strip() for x in music_track.columns.tolist()]

Example songs (first 5)

In [24]:
music_track.head()
Out[24]:
id album artist title mp3url description imageurl category_id
0 248 \N B.B.King The Thrill is Gone Music/Blues/B.B.King - The Thrill is Gone.mp3 \N img/music.jpg 1
1 250 \N Mamie Smith Crazy Blues Music/Blues/Mamie Smith - Crazy Blues.mp3 \N img/music.jpg 1
2 251 \N Robert Johnson Hellhound On My Trail Music/Blues/Robert Johnson - Hellhound On My T... \N img/music.jpg 1
3 252 \N T-Bone Walker Stormy Monday Music/Blues/T-Bone Walker - Stormy Monday.mp3 \N img/music.jpg 1
4 253 \N Antonio Vivaldi Four Seasons Music/Classical/Antonio Vivaldi - Four Seasons... \N img/music.jpg 2

Some of columns are not interesting or valuable at all. In example, description attribute is always empty, and the reference to (probably) album cover is invalid. Name of the mp3 file doesn't hold any information as well. On the other hand, we can use pair of (artist, title) to get lyrics (and their sentiment), but it's out of scope for this exploratory analysis.

In [25]:
music_track = music_track[['id', 'artist', 'title', 'category_id']].rename(columns={'id': 'ItemID'})
In [26]:
music_track.dtypes
Out[26]:
ItemID          int64
artist         object
title          object
category_id     int64
dtype: object
In [27]:
# TODO: use Levenshtein distance to check for possible duplicates

Most common artists

In [28]:
most_common_artist = music_track.groupby('artist')['artist'].count()
most_common_artist.sort_values(ascending=False).head(12)
Out[28]:
artist
Unheilig           4
Lady GaGa          4
Ich + Ich          3
Black Eyed Peas    3
Timbaland          3
Rihanna            3
Jan Delay          2
Jason Derulo       2
OneRepublic        2
Culcha Candela     2
Bob Marley         2
Jay-Z              2
Name: artist, dtype: int64

Music category

In [29]:
music_category.columns = ['category_id', 'category name']
In [30]:
music_category.dtypes
Out[30]:
category_id       int64
category name    object
dtype: object

Feature engineering

In [31]:
contextual_rating.iloc[3].to_dict()
Out[31]:
{'UserID': 1001,
 'ItemID': 259,
 'Rating': 4,
 'DrivingStyle': nan,
 'landscape': nan,
 'mood': nan,
 'naturalphenomena': nan,
 'RoadType': nan,
 'sleepiness': nan,
 'trafficConditions': nan,
 'weather': 'snowing'}
In [32]:
music_track.iloc[10].to_dict()
Out[32]:
{'ItemID': 259,
 'artist': 'Hank Williams',
 'title': 'Im So Lonesome I Could Cry',
 'category_id': 3}

Join tables

Enrich main dataframe with information on the song

In [33]:
data = contextual_rating.merge(music_track, suffixes=('_ctx', '_track'), on='ItemID', how='inner')
In [34]:
if len(data) == len(contextual_rating):
    print(f'Success! Every music track is findable in the "music_track" dataframe')
else:
    raise AssertionError("Some tracks are missing.")
Success! Every music track is findable in the "music_track" dataframe

Add information about the category of the song

In [35]:
data = data.merge(music_category, on='category_id', how='inner')
In [36]:
if len(data) == len(contextual_rating):
    print(f'Success! Every music track has a proper music category assigned')
else:
    raise AssertionError("Some categories are missing.")
Success! Every music track has a proper music category assigned

Characterize users by their activity

Count number of items user listened to a song

In [37]:
counts = data.groupby('UserID')['UserID'].count()
In [38]:
data['number_of_songs'] = data.apply(lambda x: counts[x.UserID], axis=1)
In [39]:
counts.hist(bins=20)
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f27ad1ac828>

Count nuber of unique songs user listened to

In [40]:
counts = data.groupby(['UserID', ])['ItemID'].nunique()
In [41]:
data['number_of_unique_songs'] = data.apply(lambda x: counts[x.UserID], axis=1)
In [42]:
counts.hist(bins=20)
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f278af1cb00>

Count number of unique music genres

In [43]:
counts = data.groupby(['UserID', ])['category_id'].nunique()
In [44]:
data['number_of_unique_genres'] = data.apply(lambda x: counts[x.UserID], axis=1)
In [45]:
counts.hist(bins=10)
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f278aea0940>
In [46]:
counts = data.groupby(['UserID', 'category name'])['category name'].count()

genre_counts = counts.groupby(level=0).apply(lambda x: x / float(x.sum()))
In [47]:
def second_dominant(c):
    d, _max = {}, c.max()
    for item in c.items():
        e, v = item[0], item[1]
        if v in d:
            d[v] += ', ' + e
        else:
            d[v] = e
    del d[_max]
    try:
        return d[max(list(d.keys()))]
    except ValueError:
        return None
In [48]:
data['dominant_genre'] = data.apply(lambda x: genre_counts[x['UserID']].idxmax(), axis=1)
In [49]:
data['second_dominant_genre'] = data.apply(lambda x: second_dominant(genre_counts[x['UserID']]), axis=1)
In [50]:
data['genre_ratio'] = data.apply(lambda x: genre_counts[x['UserID']][x['category name']], axis=1)
In [51]:
data['genre_ratio'].hist()
Out[51]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f278adc0898>
In [52]:
# assign ratio of the most dominant music genre
data['main_genre_dominance'] = data.apply(lambda x: genre_counts[x['UserID']].max(), axis=1)
In [53]:
data['main_genre_dominance'].hist(bins=20)
Out[53]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f278ad78160>

Music preference ("taste")

I believe there is no need to create an implicit attribute which would characterize music preferences, as it's deductable from whole set of attributes "per se". We already have information on the most and 2nd dominant genre. We know the dominance ratio. There is an attribute on total number of genres the user listen(-ed) to.

In [54]:
# Let's plot it and see if there are any correlations

Sensorial stimulus vs choice of music

In [55]:
calm_conditions = {'landscape': ['country side', ],
                   'mood': ['lazy', ],
                   'driving style': ['relaxed driving', ],
                   'natural phenomena': ['afternoon', 'night'],
                   'sleepiness': ['sleepy', ],
                   'traffic conditions': ['free road', ],
                   'weather': ['cloudy', ]}

def score_calm_conditions(row):
    points = 0.0
    if (row['landscape'] is not None) and (row['landscape'] in calm_conditions['landscape']):
        points += 1.0
    if (row['mood'] is not None) and (row['mood'] in calm_conditions['mood']):
        points += 1.0
    if (row['DrivingStyle'] is not None) and (row['DrivingStyle'] in calm_conditions['driving style']):
        points += 1.0
    if (row['naturalphenomena'] is not None) and (row['naturalphenomena'] in calm_conditions['natural phenomena']):
        points += 1.0
    if (row['sleepiness'] is not None) and (row['sleepiness'] in calm_conditions['sleepiness']):
        points += 1.0
    if (row['trafficConditions'] is not None) and (row['trafficConditions'] in calm_conditions['traffic conditions']):
        points += 1.0
    if (row['weather'] is not None) and (row['weather'] in calm_conditions['weather']):
        points += 1.0
    return points

stimulation_conditions = {'landscape': ['mountains/hills', ],
                          'mood': ['active', ],
                          'driving style': ['sport driving', ],
                          'natural phenomena': ['day time', 'morning'],
                          'sleepiness': ['awake', ],
                          'traffic conditions': ['lots of cars', 'traffic jam', ],
                          'weather': ['rainy', 'snowing', ]}

def score_stimulative_conditions(row):
    points = 0.0
    if (row['landscape'] is not None) and (row['landscape'] in stimulation_conditions['landscape']):
        points += 1.0
    if (row['mood'] is not None) and (row['mood'] in stimulation_conditions['mood']):
        points += 1.0
    if (row['DrivingStyle'] is not None) and (row['DrivingStyle'] in stimulation_conditions['driving style']):
        points += 1.0
    if (row['naturalphenomena'] is not None) and (row['naturalphenomena'] in stimulation_conditions['natural phenomena']):
        points += 1.0
    if (row['sleepiness'] is not None) and (row['sleepiness'] in stimulation_conditions['sleepiness']):
        points += 1.0
    if (row['trafficConditions'] is not None) and (row['trafficConditions'] in stimulation_conditions['traffic conditions']):
        points += 1.0
    if (row['weather'] is not None) and (row['weather'] in stimulation_conditions['weather']):
        points += 1.0
    return points
In [56]:
data['no_stimulus_points'] = data.apply(lambda x: score_calm_conditions(x), axis=1)
data['stimulus_points'] = data.apply(lambda x: score_stimulative_conditions(x), axis=1)

Remove duplicated columns

In [57]:
data.drop('category_id', axis=1, inplace=True)

Data book

In [58]:
data.columns
Out[58]:
Index(['UserID', 'ItemID', 'Rating', 'DrivingStyle', 'landscape', 'mood',
       'naturalphenomena', 'RoadType', 'sleepiness', 'trafficConditions',
       'weather', 'artist', 'title', 'category name', 'number_of_songs',
       'number_of_unique_songs', 'number_of_unique_genres', 'dominant_genre',
       'second_dominant_genre', 'genre_ratio', 'main_genre_dominance',
       'no_stimulus_points', 'stimulus_points'],
      dtype='object')

UserID - ID which uniquely identifies a person driving & listening to track

ItemID - track ID which uniquely identifies a song

Rating - a 5-star rating given to a track

DrivingStyle - what the style of driving (while actually listening to the song)?

Answers questions: a) Are you driving in a relaxed mood? b) Is your driving style is very sportive?

landscape - what type of landscape is behind window?

Answers questions: a) Are you driving along a coast line? b) Are you on a country side? c) Are you driving between mountains or hills? d) Are you driving in a city?

mood - current mood of the driver

Answers questions: a) Are you feeling very active? b) Are you happy? c) Are you lazy now? d) Do you feel sad?

naturalphenomena - what time of a day is it

Answers questions: a) Is it afternoon? b) Is it day time? c) Is it is morning? d) Are you driving at night?

RoadType - what type of road the driver is currently traveling through

Answers questions: a) Are you driving in a city street? b) Are you driving on a highway? c) Are you driving through serpentine curves?

Pair plot

In [59]:
columns_of_interest = ['Rating', 'number_of_songs', 'stimulus_points', 'main_genre_dominance',
                       'number_of_unique_songs', 'number_of_unique_genres', 'category name']
In [60]:
sns.pairplot(data=data[columns_of_interest], hue="category name", dropna=True)
Out[60]:
<seaborn.axisgrid.PairGrid at 0x7f278ad04c18>

Hot encoding of categorical variables

In [61]:
data.columns
Out[61]:
Index(['UserID', 'ItemID', 'Rating', 'DrivingStyle', 'landscape', 'mood',
       'naturalphenomena', 'RoadType', 'sleepiness', 'trafficConditions',
       'weather', 'artist', 'title', 'category name', 'number_of_songs',
       'number_of_unique_songs', 'number_of_unique_genres', 'dominant_genre',
       'second_dominant_genre', 'genre_ratio', 'main_genre_dominance',
       'no_stimulus_points', 'stimulus_points'],
      dtype='object')
In [62]:
data_encoded = pd.get_dummies(data.drop(['artist', 'title'], axis=1))
In [63]:
';  '.join(data_encoded.columns.tolist())
Out[63]:
'UserID;  ItemID;  Rating;  number_of_songs;  number_of_unique_songs;  number_of_unique_genres;  genre_ratio;  main_genre_dominance;  no_stimulus_points;  stimulus_points;  DrivingStyle_relaxed driving;  DrivingStyle_sport driving;  landscape_coast line;  landscape_country side;  landscape_mountains;  landscape_urban;  mood_active;  mood_happy;  mood_lazy;  mood_sad;  naturalphenomena_afternoon;  naturalphenomena_day time;  naturalphenomena_morning;  naturalphenomena_night;  RoadType_city;  RoadType_highway;  RoadType_serpentine;  sleepiness_awake;  sleepiness_sleepy;  trafficConditions_free road;  trafficConditions_lots of cars;  trafficConditions_traffic jam;  weather_cloudy;  weather_rainy;  weather_snowing;  weather_sunny;  category name_Blues music;  category name_Classical music;  category name_Country music;  category name_Disco music;  category name_Hip Hop music;  category name_Jazz music;  category name_Metal music;  category name_Pop music;  category name_Reggae music;  category name_Rock music;  dominant_genre_Blues music;  dominant_genre_Pop music;  dominant_genre_Rock music;  second_dominant_genre_Blues music;  second_dominant_genre_Blues music, Classical music, Disco music;  second_dominant_genre_Blues music, Classical music, Hip Hop music;  second_dominant_genre_Blues music, Disco music, Rock music;  second_dominant_genre_Blues music, Hip Hop music;  second_dominant_genre_Blues music, Metal music, Reggae music;  second_dominant_genre_Classical music;  second_dominant_genre_Classical music, Country music;  second_dominant_genre_Classical music, Country music, Disco music, Hip Hop music;  second_dominant_genre_Classical music, Country music, Disco music, Hip Hop music, Jazz music, Metal music, Rock music;  second_dominant_genre_Classical music, Disco music;  second_dominant_genre_Classical music, Disco music, Reggae music;  second_dominant_genre_Classical music, Hip Hop music, Rock music;  second_dominant_genre_Country music;  second_dominant_genre_Country music, Disco music, Rock music;  second_dominant_genre_Country music, Jazz music, Rock music;  second_dominant_genre_Disco music;  second_dominant_genre_Disco music, Hip Hop music;  second_dominant_genre_Jazz music;  second_dominant_genre_Metal music'

Aggregating data - alternative set

In [64]:
data_aggr = pysqldf('''select UserID, ItemID, avg(Rating) as avg_rating,
                       avg(number_of_unique_songs) as number_of_unique_songs,
                       avg(number_of_unique_genres) as number_of_unique_genres,
                       avg(genre_ratio) as genre_ratio,
                       avg(main_genre_dominance) as main_genre_dominance,
                       sum(no_stimulus_points) as no_stimulus_points,
                       sum(stimulus_points) as stimulus_points,
                       sum(`DrivingStyle_relaxed driving`) as driving_style_relaxed_driving,  
                       sum(`DrivingStyle_sport driving`) as driving_style_sport_driving,  
                       sum(`landscape_coast line`) as landscape_coast_line,
                       sum(`landscape_country side`) as landscape_country_side,
                       sum(landscape_mountains) as landscape_mountains,  
                       sum(landscape_urban) as landscape_urban,
                       sum(mood_active) as mood_active,
                       sum(mood_happy) as mood_happy,
                       sum(mood_lazy) as mood_lazy,
                       sum(mood_sad) as mood_sad,
                       sum(naturalphenomena_afternoon) as natural_phenomena_afternoon,
                       sum(`naturalphenomena_day time`) as natural_phenomena_day_time,
                       sum(naturalphenomena_morning) as natural_phenomena_morning,
                       sum(naturalphenomena_night) as natural_phenomena_night,
                       sum(RoadType_city) as road_type_city,
                       sum(RoadType_highway) as road_type_highway,
                       sum(RoadType_serpentine) as road_type_serpentine,
                       sum(sleepiness_awake) as sleepiness_awake,
                       sum(sleepiness_sleepy) as sleepiness_sleepy,
                       sum(`trafficConditions_free road`) as traffic_conditions_free_road,
                       sum(`trafficConditions_lots of cars`) as traffic_conditions_lots_of_cars,
                       sum(`trafficConditions_traffic jam`) as traffic_conditions_traffic_jam,
                       sum(weather_cloudy) as weather_cloudy,
                       sum(weather_rainy) as weather_rainy,
                       sum(weather_snowing) as weather_snowing,
                       sum(weather_sunny) as weather_sunny,
                       avg(`category name_Blues music`) as category_name_blues,
                       avg(`category name_Classical music`) as category_name_classical,
                       avg(`category name_Country music`) as category_name_country,
                       avg(`category name_Disco music`) as category_name_disco,
                       avg(`category name_Hip Hop music`) as category_name_hip_hop,
                       avg(`category name_Jazz music`) as category_name_jazz,
                       avg(`category name_Metal music`) as category_name_metal,
                       avg(`category name_Pop music`) as category_name_pop,
                       avg(`category name_Reggae music`) as category_name_reggae,
                       avg(`category name_Rock music`) as category_name_rock, 
                       avg(`dominant_genre_Blues music`) as dominant_genre_blues,
                       avg(`dominant_genre_Pop music`) as dominant_genre_pop,
                       avg(`dominant_genre_Rock music`) as dominant_genre_rock,
                       avg(`second_dominant_genre_Blues music`) as second_dominant_blues,
                       avg(`second_dominant_genre_Blues music, Classical music, Disco music`) as second_dominant_blues_classical_disco,
                       avg(`second_dominant_genre_Blues music, Classical music, Hip Hop music`) as second_dominant_blues_classicalsecond_dominant_hh,
                       avg(`second_dominant_genre_Blues music, Disco music, Rock music`) as second_dominant_blues_disco_rock,
                       avg(`second_dominant_genre_Blues music, Hip Hop music`) as second_dominant_blues_hh,
                       avg(`second_dominant_genre_Blues music, Metal music, Reggae music`) as second_dominant_blues_metal_reggae,
                       avg(`second_dominant_genre_Classical music`) as second_dominant_classical,
                       avg(`second_dominant_genre_Classical music, Country music`) as second_dominant_classical_country,
                       avg(`second_dominant_genre_Classical music, Country music, Disco music, Hip Hop music`) as second_dominant_classical_country_disco_hh,
                       avg(`second_dominant_genre_Classical music, Country music, Disco music, Hip Hop music, Jazz music, Metal music, Rock music`) as second_dominant_classical_country_disco_hh_jazz_metal_rock,
                       avg(`second_dominant_genre_Classical music, Disco music`) as second_dominant_classical_disco,
                       avg(`second_dominant_genre_Classical music, Disco music, Reggae music`) as second_dominant_classical_disco_reggae,
                       avg(`second_dominant_genre_Classical music, Hip Hop music, Rock music`) as second_dominant_classical_hh_rock,
                       avg(`second_dominant_genre_Country music`) as second_dominant_country,
                       avg(`second_dominant_genre_Country music, Disco music, Rock music`) as second_dominant_country_disco_rock,
                       avg(`second_dominant_genre_Country music, Jazz music, Rock music`) as second_dominant_country_jazz_rock,
                       avg(`second_dominant_genre_Disco music`) as second_dominant_disco,
                       avg(`second_dominant_genre_Disco music, Hip Hop music`) as second_dominant_disco_hh,
                       avg(`second_dominant_genre_Jazz music`) as second_dominant_jazz,
                       avg(`second_dominant_genre_Metal music`) as second_dominant_metal
                       from data_encoded
                       group by UserID, ItemID;''')
In [65]:
data_aggr.describe()
Out[65]:
UserID ItemID avg_rating number_of_unique_songs number_of_unique_genres genre_ratio main_genre_dominance no_stimulus_points stimulus_points driving_style_relaxed_driving ... second_dominant_classical_disco second_dominant_classical_disco_reggae second_dominant_classical_hh_rock second_dominant_country second_dominant_country_disco_rock second_dominant_country_jazz_rock second_dominant_disco second_dominant_disco_hh second_dominant_jazz second_dominant_metal
count 930.000000 930.000000 930.000000 930.000000 930.000000 930.000000 930.000000 930.000000 930.000000 930.000000 ... 930.000000 930.000000 930.000000 930.000000 930.000000 930.000000 930.000000 930.000000 930.000000 930.000000
mean 1019.140860 557.454839 2.373441 64.984946 8.132258 0.394187 0.589208 0.986022 1.088172 0.194624 ... 0.016129 0.016129 0.005376 0.016129 0.010753 0.010753 0.682796 0.012903 0.016129 0.005376
std 11.322104 216.087419 1.275900 48.325411 2.641218 0.271515 0.102080 0.868550 0.862301 0.404194 ... 0.126040 0.126040 0.073166 0.126040 0.103192 0.103192 0.465638 0.112918 0.126040 0.073166
min 1001.000000 248.000000 0.000000 1.000000 1.000000 0.010101 0.200000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1009.000000 281.000000 1.000000 20.000000 6.000000 0.080000 0.542857 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 1019.000000 692.000000 2.250000 70.000000 10.000000 0.542857 0.567308 1.000000 1.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000
75% 1031.750000 729.750000 3.250000 116.000000 10.000000 0.600000 0.637931 1.000000 2.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000
max 1042.000000 762.000000 5.000000 139.000000 10.000000 1.000000 1.000000 5.000000 5.000000 2.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 68 columns

Profile all data

Un-aggregated data

In [66]:
report = data.profile_report(style={'full_width':True})
/home/oskar/.local/lib/python3.6/site-packages/pandas_profiling/model/correlations.py:124: UserWarning:

There was an attempt to calculate the cramers correlation, but this failed.
To hide this warning, disable the calculation
(using `df.profile_report(correlations={"cramers": False}`)
If this is problematic for your use case, please report this as an issue:
https://github.com/pandas-profiling/pandas-profiling/issues
(include the error message: 'No data; `observed` has size 0.')

In [67]:
report
Out[67]:

In [68]:
f'Rejected variables due to high correlation with some other variables: {report.get_rejected_variables()}'
Out[68]:
"Rejected variables due to high correlation with some other variables: ['number_of_unique_songs']"

Aggregated data

In [69]:
report_aggr = data_aggr.profile_report(style={'full_width':True})
In [70]:
f'Rejected variables due to high correlation with some other variables: {report_aggr.get_rejected_variables()}'
Out[70]:
"Rejected variables due to high correlation with some other variables: ['genre_ratio']"

User segmentation

In [71]:
excluded = []
excluded_aggr = []
# id-columns
excluded.extend(['UserID', 'ItemID'])
excluded_aggr.extend(['UserID', 'ItemID'])
# correlated
excluded.extend(['number_of_unique_songs', ])
excluded_aggr.extend(['genre_ratio', ])

print(f'Excluded columns for normal dataset: {excluded},\n aggregated dataset: {excluded_aggr}')

mms, mms_aggr = MinMaxScaler(), MinMaxScaler()
data_transformed = mms.fit_transform(data_encoded.drop(excluded, axis=1))
data_transformed_aggr = mms_aggr.fit_transform(data_aggr.drop(excluded, axis=1))
Excluded columns for normal dataset: ['UserID', 'ItemID', 'number_of_unique_songs'],
 aggregated dataset: ['UserID', 'ItemID', 'genre_ratio']
In [72]:
sum_of_squared_distances = []
sum_of_squared_distances_aggr = []
k_range = range(1, 30)

for k in k_range:
    km = MiniBatchKMeans(n_clusters=k)
    km = km.fit(data_transformed)
    sum_of_squared_distances.append(km.inertia_)

    
for k in k_range:
    km = MiniBatchKMeans(n_clusters=k)
    km = km.fit(data_transformed_aggr)
    sum_of_squared_distances_aggr.append(km.inertia_)
In [73]:
plt.figure(figsize=(20, 6))

plt.subplot(1, 2, 1)

plt.plot(k_range, sum_of_squared_distances_aggr, 'bx-')

plt.xlabel('value of [k]')
plt.ylabel('Sum of squared_distances')
plt.title('Elbow method for optimal value of [k] - Aggregated Dataset')

plt.subplot(1, 2, 2)

plt.plot(k_range, sum_of_squared_distances, 'bx-')

plt.xlabel('value of [k]')
plt.ylabel('Sum of squared_distances')
plt.title('Elbow method for optimal value of [k] - "Normal" dataset')

plt.show()

We used an "elbow method" to find an optimal number of clusters for the KMeans algorithm, that number is around k=10

In [74]:
km = MiniBatchKMeans(n_clusters=10)
km = km.fit(data_transformed)

km_aggr = MiniBatchKMeans(n_clusters=10)
km_aggr = km_aggr.fit(data_transformed_aggr)
In [75]:
labels = km.predict(data_transformed)
labels_aggr = km_aggr.predict(data_transformed_aggr)
In [76]:
data_labeled = data.copy()
data_aggr_labeled = data_aggr.copy()
In [77]:
data_labeled['segment'] = labels.tolist()
data_aggr_labeled['segment'] = labels_aggr.tolist()
In [78]:
data_aggr_labeled['category_name'] = data_aggr.apply(lambda x: 
                                                     data[(data.UserID==x.UserID) & 
                                                          (data.ItemID==x.ItemID)].iloc[0]['category_name'], 
                                                     axis=1)
In [79]:
fig = px.scatter(data_aggr_labeled, x="segment", 
             y="avg_rating", color="category_name",
             size="number_of_unique_songs",
             hover_data=['stimulus_points', ])
fig.show()

Dimensionality reduction

Explaining variance

In [80]:
comp_analysis = pd.DataFrame()

comp_analysis['label'] = data_aggr_labeled['segment']

# scaler = MinMaxScaler()
# X_imputed = scaler.fit_transform(data_aggr.drop(['UserID', 'ItemID'], axis=1))
# data_for_pca = pd.DataFrame(X_imputed, columns = data_aggr.drop(['UserID', 'ItemID'], axis=1).columns)

# scaling lowers explained variance

data_for_pca = data_aggr.drop(['UserID', 'ItemID', 'genre_ratio'], axis=1)

pca = PCA(n_components=3)  # n_components - number of components to keep
pca_result = pca.fit_transform(data_for_pca.values)

comp_analysis['pca-one'] = pca_result[:,0]
comp_analysis['pca-two'] = pca_result[:,1]
comp_analysis['pca-three'] = pca_result[:,2]

variance = pca.explained_variance_ratio_
print(f'Explained variance: {variance}')

sv = pca.singular_values_
print(f'Singular values: {variance}')
Explained variance: [9.95765277e-01 1.25955271e-03 6.93512217e-04]
Singular values: [9.95765277e-01 1.25955271e-03 6.93512217e-04]
In [81]:
pcv = pca.explained_variance_ratio_
'Total explained variance by {} components is equal to {}'.format(len(pcv), sum(pcv))
Out[81]:
'Total explained variance by 3 components is equal to 0.9977183419073326'
In [82]:
fig = plt.figure(figsize=(13, 10))
ax = fig.add_subplot(111, projection= '3d')

ax.scatter(comp_analysis[comp_analysis.label.isin([1,6,2])]['pca-one'], 
            comp_analysis[comp_analysis.label.isin([1,6,2])]['pca-two'], 
            comp_analysis[comp_analysis.label.isin([1,6,2])]['pca-three'], s=15, color='b')
ax.scatter(comp_analysis[comp_analysis.label.isin([9,4])]['pca-one'], 
            comp_analysis[comp_analysis.label.isin([9,4])]['pca-two'], 
            comp_analysis[comp_analysis.label.isin([9,4])]['pca-three'], s=30, color='g')
ax.scatter(comp_analysis[comp_analysis.label == 3]['pca-one'], 
            comp_analysis[comp_analysis.label == 3]['pca-two'], 
            comp_analysis[comp_analysis.label == 3]['pca-three'], s=40, color='y')
ax.scatter(comp_analysis[comp_analysis.label == 8]['pca-one'], 
            comp_analysis[comp_analysis.label == 8]['pca-two'], 
            comp_analysis[comp_analysis.label == 8]['pca-three'], s=40, color='r')

ax.set_xlabel('pca-one')
ax.set_ylabel('pca-two')
ax.set_zlabel('pca-three')
Out[82]:
Text(0.5,0,'pca-three')

Minimization of relative entropy

T-SNE is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.

n_components - dimension of the embedded space.

perplexity - is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity.

early_exaggeration - controls how tight natural clusters in the original space are in the embedded space and how much space will be between them. For larger values, the space between natural clusters will be larger in the embedded space.

In [83]:
tsne_input = data_aggr.drop(['UserID', 'ItemID', 'genre_ratio'], axis=1)

tsne = TSNE(n_components=3, 
            verbose=1, random_state=42, 
            perplexity=28.0, 
            early_exaggeration=12.0,
            n_iter=500)
tsne_results = tsne.fit_transform(tsne_input.values)
[t-SNE] Computing 85 nearest neighbors...
[t-SNE] Indexed 930 samples in 0.002s...
[t-SNE] Computed neighbors for 930 samples in 0.036s...
[t-SNE] Computed conditional probabilities for sample 930 / 930
[t-SNE] Mean sigma: 1.359837
[t-SNE] KL divergence after 250 iterations with early exaggeration: 48.546093
[t-SNE] KL divergence after 500 iterations: 0.453473
In [84]:
df_tsne = pd.DataFrame()

df_tsne['label'] = data_aggr_labeled['segment']

df_tsne['x-tsne'] = tsne_results[:,0]
df_tsne['y-tsne'] = tsne_results[:,1]
df_tsne['z-tsne'] = tsne_results[:,2]
In [85]:
# filter out anomalies
df_tsne = df_tsne[df_tsne['z-tsne'] < 1]
In [86]:
fig = plt.figure(figsize=(13, 10))
ax = fig.add_subplot(111, projection= '3d')

ax.scatter(df_tsne[df_tsne.label.isin([1,6,2])]['x-tsne'], 
            df_tsne[df_tsne.label.isin([1,6,2])]['y-tsne'], 
            df_tsne[df_tsne.label.isin([1,6,2])]['z-tsne'], s=15, color='b')
ax.scatter(df_tsne[df_tsne.label.isin([9,4])]['x-tsne'], 
            df_tsne[df_tsne.label.isin([9,4])]['y-tsne'], 
            df_tsne[df_tsne.label.isin([9,4])]['z-tsne'], s=30, color='g')
ax.scatter(df_tsne[df_tsne.label == 3]['x-tsne'], 
            df_tsne[df_tsne.label == 3]['y-tsne'], 
            df_tsne[df_tsne.label == 3]['z-tsne'], s=40, color='y')
ax.scatter(df_tsne[df_tsne.label == 8]['x-tsne'], 
            df_tsne[df_tsne.label == 8]['y-tsne'], 
            df_tsne[df_tsne.label == 8]['z-tsne'], s=40, color='r')

ax.set_xlabel('x-tsne')
ax.set_ylabel('y-tsne')
ax.set_zlabel('z-tsne')
Out[86]:
Text(0.5,0,'z-tsne')

Association rule learning (Apriori)

Association rule mining is a technique to identify underlying relations between different items. Normally, it is used to predict which items will be bought together. Take an example of a super market, where customers can buy variety of items. Usually, there is a pattern in what the customers buy. We can apply similar technique to predict which songs will be listened to together during a single drive.

In [87]:
songs = []

t_grouped = data_aggr.groupby(['UserID', ])

for group_name, group in t_grouped:
    items = []
    for row_index, row in group.iterrows():
        items.append(music_track[music_track.ItemID == row['ItemID']].iloc[0].title)
    songs.append(items)

There are three major components of Apriori algorithm:

Support
Confidence
Lift
In [88]:
results = apriori(songs, min_confidence=0.75, min_lift=5, min_length=2)

The minimum lift of "5" tells us that right element is 5.0 times more likely to be bought by the customers who buy left element compared to the default likelihood of the sale of right element.

In [89]:
apriori_results = itertools.islice(results, 5)

Example results (rules)

In [90]:
association_rules = list(apriori_results)

for item in association_rules:
    # first index of the inner list
    # Contains base item and add item
    pair = item[0] 
    items = [x for x in pair]
    print("Rule: " + items[0] + " -> " + items[1])

    #second index of the inner list
    print("Support: " + str(item[1]))

    #third index of the list located at 0th
    #of the third index of the inner list

    print("Confidence: " + str(item[2][0][2]))
    print("Lift: " + str(item[2][0][3]))
    print("=====================================")
Rule: Neopolitan Dreams -> Alles kann besser werde
Support: 0.11904761904761904
Confidence: 0.8333333333333334
Lift: 5.000000000000001
=====================================
Rule: Alles kann besser werde -> Secrets
Support: 0.11904761904761904
Confidence: 0.8333333333333334
Lift: 5.000000000000001
=====================================
Rule: Alles wird gut - Album Version -> I Can't Quit You Baby
Support: 0.11904761904761904
Confidence: 0.8333333333333334
Lift: 5.000000000000001
=====================================
Rule: Hypnotize -> Bad Romance
Support: 0.11904761904761904
Confidence: 0.8333333333333334
Lift: 5.000000000000001
=====================================
Rule: Pursuit Of Vikings -> Bad Romance
Support: 0.14285714285714285
Confidence: 0.8571428571428571
Lift: 5.142857142857143
=====================================

Export data to file

CSV file

Detailed (un-aggregated) set

Saving hot-encoded, context aware dataset to a comma-separated file, which can be later read by pandas in different notebook.

In [91]:
data_encoded.to_csv('dataset.csv', sep=",", index=False, header=True)

Aggregated set

Saving hot-encoded, aggregated dataset to a comma-separated file.

In [92]:
data_aggr.to_csv('dataset_aggr.csv', sep=",", index=False, header=True)

Pickle file

We're pickling whole datasets to a binary pickle file.

In [93]:
pickle.dump(data_encoded, file=open("dataset.pickle", 'wb'))
pickle.dump(data_aggr, file=open("dataset_aggr.pickle", 'wb'))

Other data

In [94]:
# song - category name mapping
data[['ItemID', 'category_name']].drop_duplicates().to_csv('music_type.csv', sep=",", index=False, header=True)